Innovative Data Exploration with LASSO: Unveiling Patterns in Earnings and Education

Muhammad Usman Aslam
Sai Kumar Miryala
Nikhitha Amireddy
Ishrath Jahan

2024-04-22

What is LASSO Regression?

LASSO (Least Absolute Shrinkage and Selection Operator), introduced by Robert Tibshirani in 1996 [@Tibshirani1996].

LASSO regression, also known as L1 regularization, is a popular technique used in statistical modeling and machine learning to estimate the relationships between variables and make predictions.

Primary goal of LASSO is to shrink some coefficients to exactly zero, effectively performing variable selection by excluding irrelevant predictors from the model which helps to find a balance between model simplicity and accuracy.

Applications Across Fields

LASSO regression’s versatility across multiple fields illustrates its capability to manage complex datasets effectively, particularly with continuous outcomes.

Zhou et al. [Zhou2022] highlighted LASSO’s ability to identify key economic predictors that assist in strategic decision-making.

This example underscores its utility in economic analysis, where it helps to isolate factors that directly influence continuous economic outcomes like wages, prices, or economic growth.

Lu et al. and Musoro [@Lu2011; @Musoro2014] used LASSO regression to develop models based on gene expression data, advancing our understanding of genetic influences on continuous traits and diseases. Their work illustrates how LASSO can handle vast amounts of biological data to pinpoint critical genetic pathways.

McEligot et al. (2020)[@McEligot2020] employed logistic LASSO to explore how dietary factors, which vary continuously, affect the risk of developing breast cancer. Their findings highlight LASSO’s strength in dealing with complex, high-dimensional datasets in health sciences.

Advantages of LASSO Regression

LASSO regression is highly valued in fields ranging from healthcare to finance due to its ability to simplify complex models without sacrificing accuracy. This method’s key strengths include:

-Feature Selection: LASSO can set some coefficients exactly to zero, effectively choosing the most relevant variables from many possibilities. This automatic feature selection helps focus the model on the truly impactful factors. [@Park2008]

-Model Interpretability: By eliminating irrelevant variables, LASSO makes the resulting models easier to understand and communicate, enhancing their practical use. [@Belloni2013]

-Mitigation of Multicollinearity: LASSO addresses issues that arise when predictor variables are highly correlated. It selects one variable from a group of closely related variables, which simplifies the model and avoids redundancy. [@Efron2004]

Methodology Overview

LASSO enhances linear regression by adding a penalty on the size of the coefficients, aiding in feature selection and improving model interpretability.

LASSO’s objective function:

\[ \min_{\beta} \left\{ \frac{1}{2n} \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right\} \] - Goal: Minimize Residual Sum of Squares(RSS) with a penalty on the absolute values of coefficients.

-Parameter λ: Balances model complexity against overfitting.

1. Beta Coefficients (\(\beta\))

-These are the parameters of the model, where \(\beta_0\) is the intercept, and \(\beta_j\) are the coefficients for the predictors.

2. Observed Values (\(y_i\))

-These are the responses observed for each observation in the dataset.

3. Predictor Values (\(x_{ij}\))

-These are the values of the predictors for each observation.

4. Residual Sum of Squares (RSS)

-It measures the discrepancies between observed values and predictions, normalized by \(\frac{1}{2n}\) for computational convenience.

5. Regularization Parameter (\(\lambda\))

-This parameter controls the trade-off between fitting the model accurately and keeping the model coefficients small.

6. L1 Penalty

-This term encourages the sparsity of the model by allowing some coefficients to shrink to zero.

How LASSO Regression Works?

LASSO regression starts with the standard linear regression model, which assumes a linear relationship between the independent variables (features) and the dependent variable (target).

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n + \epsilon \] y is the dependent variable (target). β₀, β₁, β₂, …, βₚ are the coefficients (parameters) to be estimated. x₁, x₂, …, xₚ are the independent variables (features). ε represents the error term.

LASSO regression introduces an additional penalty term based on the absolute values of the coefficients.

The choice of the regularization parameter λ is crucial in LASSO regression:

-At λ=0, LASSO equals an ordinary least squares regression, offering no coefficient shrinkage.

-Variable Selection: As λ increases, more coefficients shrink to zero.

-Optimization: Achieved through cross-validation to find the optimal λ.

-Feature Selection: Reduces coefficients of non-essential predictors to zero.

-Regularization: Enhances model generalizability, critical for complex datasets.

-Fields of Application: Finance, healthcare, where accurate prediction is crucial.

-Comparison with MLR: Demonstrates LASSO’s superiority in handling high-dimensional data by selectively including only relevant variables.

Outline of the Project

Our project aims to explore the impact of various factors on wages using the RetSchool dataset, focusing on how education and demographic variables influence earnings in 1976. We have chosen LASSO regression to address our research questions due to its unique capabilities in dealing with complex datasets and its methodological strengths in feature selection and model accuracy.

Research Questions Addressed

  • What factors most significantly affect wages in 1976?

  • How do education and demographic variables influence wage disparities?

  • Can we predict wage outcomes based on these variables effectively using a simplified model?

Dataset Description

-Overview of RetSchool Dataset Variables:

Understanding the variables in the RetSchool dataset, crucial for analyzing socio-economic and educational influences on wages in 1976.

Variable Description Type Relevance
wage76 Wages of individuals in 1976 Continuous Primary measure of economic status
age76 Age of individuals Continuous Analyzes age impact on wages
grade76 Highest grade completed Continuous Indicates educational attainment
col4 College education Binary Impact of higher education on wages
exp76 Work experience Continuous Examines experience influence on wages
momdad14 Lived with both parents at age 14 Binary Family structure’s impact on early life outcomes
sinmom14 Lived with a single mother at age 14 Binary Focuses on single-mother household impact
daded Father’s education level Continuous Paternal education impact on offspring’s outcomes
momed Mother’s education level Continuous Maternal education impact
black Racial identification as black Binary Used to analyze racial disparities
south76 Residency in the South Binary For regional economic analysis
region Geographic region Categorical Regional influences on outcomes
smsa76 Urban residency Binary Urban versus rural disparities

Data Exploration

Initial data cleaning included addressing missing values through imputation or removal to refine the dataset for detailed analysis.

Figure 1: Work Experience Distribution in 1976

-Visualization: The right-skewed distribution of exp76 suggests a young, less experienced workforce. -Implications: Reflects entry-level workers predominating in 1976, impacting wage levels and economic conditions.

Figure 2: Wage Distribution in 1976

-Visualization: A histogram and density plot show most workers earned lower wages, with a minority earning significantly more.

-Economic Insights: Highlights income disparities and provides insights into the financial stability of the population.

Figure 3: Correlation Matrix of Selected Variables

-Analysis Tool: Visualizes relationships between key variables like wage76, grade76, exp76, and age76.

-Findings: Identifies strong predictors of wages and helps understand economic dynamics of the era.

Why LASSO for the RetSchool Dataset?

-Insight: LASSO’s automatic feature selection is pivotal in isolating significant predictors like education level and regional differences, directly impacting wage analysis.

-Benefit: Simplifies the model by focusing only on impactful variables, thus enhancing interpretability, which is critical for formulating effective educational and economic policies.

-Challenge: Overlapping influences of educational attainment and work experience on wages could lead to skewed analytical results.

-Solution: By penalizing the coefficients of correlated predictors, LASSO ensures a more stable and reliable model, addressing multicollinearity without requiring manual intervention.

-Goal: To develop a statistically robust model that stakeholders can easily understand and use.

-Outcome: LASSO’s regularization promotes model simplicity and clarity, providing straightforward insights that are essential for policy-making and strategic educational planning.

-Technique: Incorporates k-fold cross-validation within the LASSO framework to fine-tune the regularization parameter, optimizing model accuracy.

-Advantage: Enhances predictive reliability, crucial for accurately forecasting wage trends based on educational variables, thereby preventing model overfitting.

-Analysis: Compared to traditional regression methods, LASSO effectively manages large datasets with many predictors.

-Result: Demonstrates superior capacity for feature selection and multicollinearity management, making it indispensable for in-depth wage analysis in educational data.

-Variable wage76: Identified as a continuous variable, which benefits significantly from LASSO’s ability to handle continuous data without categorization.

-Importance: Ensures that the nuances and variations in wage data are accurately modeled, providing a deeper understanding of the economic factors at play.

Statistical Modeling

Proper data preparation is critical to ensure the robustness of the statistical analysis:

  • Handling Missing Data: Key variables with missing data, such as educational background and work experience, were imputed using the median of available data to minimize the impact of outliers.

  • Removing Incomplete Records: After imputation, records that still contained missing values were removed to maintain the integrity and accuracy of the model analysis.

Visual checks and plots were used to compare the distribution of variables before and after cleaning. These visualizations help confirm that the data cleaning process preserved the underlying structure of the data while improving the quality for analysis.

Selection of Target and Predictor Variables

-Target Variable:

The primary variable of interest, wage76, represents the wages of individuals in 1976 and serves as the dependent variable in our LASSO model.

y <- df_clean$wage76

-Predictor Variables:

Variables selected based on their theoretical relevance to wage determination included education level (grade76, col4), work experience (exp76), and demographic factors (e.g., age, race, geographic location).

With the data now clean and the variables of interest identified, visualizing these can provide deeper insights into their distribution and relationships within the dataset. This helps in understanding the dynamics and potential influences on wages in 1976

Figure 4: Visualizations of Key Variables

Feature Scaling

Effective feature scaling is essential before fitting the LASSO model to ensure each variable contributes equally to the analysis. This prevents any feature from disproportionately influencing the outcome due to scale variance.

-Standardization Process: All features are normalized to have zero mean and unit variance. This step is crucial for models that apply a penalty on the size of coefficients, such as LASSO.

library(caret)
library(glmnet)

# Selecting only numeric features and excluding the target variable 'wage76'
numeric_features <- select(df_clean, where(is.numeric), -wage76)

# Converting the selected features into a matrix, as required by glmnet
features <- data.matrix(numeric_features)
preProcValues <- preProcess(features, method = c("center", "scale"))
features_scaled <- predict(preProcValues, features)

Figure 5: Distribution of Feature ‘exp76’ Before and After Scaling

Cross Validation for Optimal \(\lambda\)

Selecting the optimal regularization parameter, λ, is crucial for balancing the complexity and accuracy of the LASSO model.

Cross-validation technique is used to determine the λ that minimizes prediction error. This technique ensures the model performs well on unseen data by validating the model across multiple data subsets.

Figure 6: Cross-Validation Curve

Analyzing the coefficients after fitting the model with the optimal λ reveals which variables significantly influence the dependent variable.

  • Significance of Coefficients: Coefficients that remain significant (not shrunk to zero) are key predictors of wages.

  • Interpretation of Results: The size and direction of these coefficients provide insights into how each predictor affects wage levels.

Visual representations are used to effectively communicate the impact of significant predictors, providing a clear and intuitive understanding of the model’s outcomes.

  • Goals of Visualization: To present complex statistical results in a visually engaging and easily understandable format.
  • Insights Provided:
    • Scaling Importance: Confirms that the analysis is unbiased and equitable.
    • Lambda Optimization: Demonstrates the effectiveness of our model fitting process, ensuring a balance between simplicity and predictive accuracy.
    • Predictor Impact: Illustrates how specific factors like education and demographics influence wages, guiding policy and decision-making.

Results

Understanding the impact of coefficient differences between LASSO and MLR models offers deeper insights into the complexities of the dataset and the efficacy of regularization.

We analyze both LASSO and Multiple Linear Regression (MLR) to demonstrate how each model manages the complex and continuous nature of the wage variable:

-LASSO Regression: Applies a penalty to reduce the influence of less significant predictors, enhancing model simplicity and accuracy.

-MLR: Provides a baseline by including all predictors without regularization, illustrating potential overfitting issues.

Both models are applied to the same cleaned dataset to ensure a fair comparison:

  • Model Fitting: Each model is fitted using standardized features to prevent scale disparities from affecting the results.
  • Coefficient Extraction and Comparison: We compare the significance and magnitude of coefficients from both models to identify key predictors and assess the impact of regularization.
  • Model Efficiency: LASSO’s ability to simplify the model by eliminating non-significant predictors helps in reducing overfitting and enhancing interpretability.
  • Predictive Performance: By comparing the predictive accuracy of both models, we evaluate which model provides more reliable and robust predictions for wage data.

Significant Predictors and Model Insights

The comparison of LASSO and MLR models highlights crucial predictors influencing wages, enhancing our understanding of the robustness of each statistical modeling approach.

A detailed examination of how variables are weighted differently by each model reveals significant insights:

  • Baseline and Overfitting: MLR often shows a tendency to overfit with complex datasets by retaining all predictors.
  • Variable Importance: LASSO highlights truly impactful variables by assigning zero to less important predictors, focusing analysis on factors that significantly affect wages.

A visual representation of coefficient differences effectively illustrates the impact of regularization:

  • Graphical Analysis: Charts display how each model values predictors, clarifying the practical effects of LASSO’s regularization.
  • Model Interpretation: These visuals aid in understanding which predictors are deemed most influential by each model and why.

Exploring the specific implications of our findings within the context of the Return to School dataset:

  • What We Analyzed: We focused on understanding how various educational, demographic, and work experience factors influence wage disparities in 1976.

  • Why It Matters: This analysis is crucial for identifying key areas where educational and economic policies can be targeted to reduce wage inequality.

  • How We Did It: Using LASSO and MLR, we were able to discern which variables significantly impact wages, with LASSO providing a more streamlined model that avoids overfitting and highlights the most impactful factors.

This analysis not only enhances academic understanding but also provides concrete data to inform policy makers:

  • Policy Recommendations: Insights from the study can guide the development of policies aimed at addressing the root causes of wage disparities identified through the model.
  • Educational Impact: By understanding which educational factors influence earnings, institutions can tailor programs to enhance the economic outcomes of their students.

Conclusion

Our comprehensive analysis using LASSO regression has identified pivotal factors that influenced wages in 1976, with a focus on the impact of educational attainment and age.

Figure 9:Impact of Continuous Variables on Wages
  • Educational Impact on Earnings: We found a strong correlation between higher education levels and higher wages, underscoring the substantial returns on educational investments. This insight supports the argument for increasing access to higher education as a means to elevate income levels.
  • Age and Earnings Correlation: Our results show that older age groups tend to earn more, which reflects the cumulative benefits of experience and ongoing education over time.
  • Selective Feature Retention: The LASSO model has proven effective in enhancing the clarity and focus of our analysis by selectively retaining features that have a significant impact on wages, thereby making the model not only easier to interpret but also more robust in its predictions.
  • Mitigation of Overfitting: By penalizing less significant predictors, LASSO helps maintain the reliability of our predictions, ensuring that our model performs well even on unseen data, which is crucial for making sound policy decisions.

This study opens the door for further research into additional socioeconomic factors that could affect wage disparities. Future studies could explore the impact of technological advances, economic policies, and other demographic changes on wage trends. Such research would help extend our understanding of the dynamics between education and wages over longer periods and under varying economic conditions.

We appreciate your attention and interest in our findings. We are now open to any questions you may have or discussions you would like to engage in. Your feedback and suggestions for further research areas are highly welcome.